weekdays <- c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")
class( weekdays )[1] "character"
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
[7] "Sunday"
Defining Categorical Types
In this brief presentation, we’ll be introducing the following items:
Unique and individual grouping that can be applied to a study design.
character typeThe function sample() allows us to take a random sample of elements from a vector of potential values.
However, if we want a large number items, we can have them with or without replacement.
We’ll pretend we have a bunch of data related to the day of the week.
Length Class Mode
40 character character
[1] "Wednesday" "Sunday" "Monday" "Friday" "Thursday" "Friday"
[7] "Saturday" "Sunday" "Wednesday" "Friday" "Saturday" "Monday"
[13] "Friday" "Tuesday" "Wednesday" "Tuesday" "Sunday" "Monday"
[19] "Friday" "Tuesday" "Tuesday" "Monday" "Monday" "Saturday"
[25] "Tuesday" "Tuesday" "Wednesday" "Sunday" "Wednesday" "Wednesday"
[31] "Saturday" "Monday" "Friday" "Thursday" "Friday" "Wednesday"
[37] "Sunday" "Friday" "Tuesday" "Saturday"
factor [1] Wednesday Sunday Monday Friday Thursday Friday Saturday
[8] Sunday Wednesday Friday Saturday Monday Friday Tuesday
[15] Wednesday Tuesday Sunday Monday Friday Tuesday Tuesday
[22] Monday Monday Saturday Tuesday Tuesday Wednesday Sunday
[29] Wednesday Wednesday Saturday Monday Friday Thursday Friday
[36] Wednesday Sunday Friday Tuesday Saturday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Each factor variable is defined by the levels that constitute the data. This is a .red[finite] set of unique values
If a factor is not ordinal, it does nota allow the use relational comparison operators.
Where ordination matters:
Fertilizer Treatments in KG of N2 per hectare: 10 kg N2, 20 N2, 30 N2,
Days of the Week: Friday is not followed by Monday,
Life History Stage: seed, seedling, juvenile, adult, etc.
Where ordination is irrelevant:
River
State or Region
Sample Location
[1] Wednesday Sunday Monday Friday Thursday Friday Saturday
[8] Sunday Wednesday Friday Saturday Monday Friday Tuesday
[15] Wednesday Tuesday Sunday Monday Friday Tuesday Tuesday
[22] Monday Monday Saturday Tuesday Tuesday Wednesday Sunday
[29] Wednesday Wednesday Saturday Monday Friday Thursday Friday
[36] Wednesday Sunday Friday Tuesday Saturday
7 Levels: Friday < Monday < Saturday < Sunday < Thursday < ... < Wednesday
The problem is that the default ordering is actually alphabetical!
Specifying the Order of Ordinal Factors
[1] Wednesday Sunday Monday Friday Thursday Friday Saturday
[8] Sunday Wednesday Friday Saturday Monday Friday Tuesday
[15] Wednesday Tuesday Sunday Monday Friday Tuesday Tuesday
[22] Monday Monday Saturday Tuesday Tuesday Wednesday Sunday
[29] Wednesday Wednesday Saturday Monday Friday Thursday Friday
[36] Wednesday Sunday Friday Tuesday Saturday
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday
[1] Monday Monday Monday Monday Monday Monday Tuesday
[8] Tuesday Tuesday Tuesday Tuesday Tuesday Tuesday Wednesday
[15] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday Thursday
[22] Thursday Friday Friday Friday Friday Friday Friday
[29] Friday Friday Saturday Saturday Saturday Saturday Saturday
[36] Sunday Sunday Sunday Sunday Sunday
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday
You cannot assign a value to a factor that is not one of the pre-defined levels.
forcats forcats libraryPart of the tidyverse group of packages.
This library has a lot of helper functions that make working with factors a bit easier. I’m going to give you a few examples here but strongly encourage you to look a the cheat sheet for all the other options.
We can reorder by appearance order, observations, or numeric
By Frequency
[1] "Friday" "Tuesday" "Wednesday" "Monday" "Saturday" "Sunday"
[7] "Thursday"
iris DataBritish polymath, mathematician, statistican, geneticist, and academic. Founded things such as:
F test,Exact test, Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Question: What is the mean and variance in sepal length for each of the Iris species?
The by() function allows us to perform some function on data based upon a grouping index.
by()Here we can apply the function mean() to the data on sepal length using the species factor as a category.
by()The same for estimating variance
Missing data is a fact of life and R is very opinionated about how it handles missing values. Where this becomes tricky is when we are doing operations on data that has missing values. R could take two routes:
Fortunately, R took the second route.
If there is ONE NA, then most mathematical operaitons will not give you an answer.
by()You’ll have to do the same thing when using by()
A common workflow consists of taking some data and performing several operations on it before we do some kind of analysis, summary, plot, or table. It can be
This causes a lot of data duplication of the intermediate steps, extra typing, etc. Remember we strive for minimal effort!
The Treachery of Images
In R we use this grammar.
To take the values in data and pass them as if you entered the data as the first argument to the function Y().
The maggitr library is part of the tidyverse group of packages, so it is always easier to just load in tidy
Here is an operation that we’ve used as summary( iris ) in the past, but it can be used in a pipe like this.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
knitr + table -> kableThe knitr library has some nice basic functionality to make tables.
The table should have the species names and the averages length and width of sepals.
Make a new data frame and set the First Column as species.
Use the by() function to estimate mean length and width
So, now we’ll use our new pipe operator to pass the data into the kable() fuction (n.b., look at ?kable and see that the first argument is the data, which is being substituted by the pipe).
The library kableExtras has a lot more functionality that can be added to the table.
position = "float_right"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit libero sit amet porta elementum. In imperdiet tellus non odio porttitor auctor ac sit amet diam. Suspendisse eleifend vel nisi nec efficitur. Ut varius urna lectus, ac iaculis velit bibendum eget. Curabitur dignissim magna eu odio sagittis blandit.
| Species | Length | Width |
|---|---|---|
| setosa | 5.006 | 3.428 |
| versicolor | 5.936 | 2.770 |
| virginica | 6.588 | 2.974 |
Vivamus sed ipsum mi. Etiam est leo, mollis ultrices dolor eget, consectetur euismod augue. In hac habitasse platea dictumst. Integer blandit ante magna, quis volutpat velit varius hendrerit. Vestibulum sit amet lacinia magna. Sed at varius nisl. Donec eu porta tellus, vitae rhoncus velit.